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ABSTRACT 

Motivation: A bioinformatics platform is introduced aimed at 
identifying models of disease-specific pathways, as well as a set 
of network measures that can quantify changes in terms of global 
structure or single link disruptions.The approach integrates a network 
comparison framework with machine learning molecular profiling. 
The platform includes different tools combined in one Open Source 
pipeline, supporting reproducibility of the analysis. We describe 
here the computational pipeline and explore the main sources of 
variability that can affect the results, namely the classifier, the feature 
ranking/selection algorithm, the enrichment procedure, the inference 
method and the networks comparison function. 
Results: The proposed pipeline is tested on a microarray dataset 
of late stage Parkinsons' Disease patients together with healty 
controls. Choosing different machine learning approaches we get 
low pathway profiling overlapping in terms of common enriched 
elements. Nevertheless, they identify different but equally meaningful 
biological aspects of the same process, suggesting the integration of 
information across different methods as the best overall strategy. 
Availability: All the elements of the proposed pipeline are available 
as Open Source Software: availability details are provided in the main 
text. 

Contact: annalisa.barla@unige.it 



1 INTRODUCTION 

We present a computational framewor k for the study o f 
reproducibility in network medicine studies jBarabasi et aliuOl lh . 
Networks, molecular pathways in particular, are increasingly 
looked at as a better organized and more rich version of gene 
signatures. However, high variability can be injected by the different 
methods that are typically used in system biology to define a 
cellular wiring diagram at diverse levels of organization, from 
transcriptomics to signalling, of the functional design. For example, 
to identify the link between changes in graph structures and 
disease, we choose and combine in a workflow a classifier, the 
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feature ranking/selection algorithm, the enrichment procedure, the 
inference method and the networks comparison function. Each of 
these components is a potential source o f variability, as shown in 
the ca se of biomarkers from microarrays jThe MAOC Consortiuml 
I2010I) . Considerable efforts have been directed to tackle the 
problem of poor reproducibility of biomarker signatures derive d 
from high-throughput -omics data jThe MAOC ConsortiumLl2010h . 
addressing the issues of sele ction bias I Ambroise and McLachlanl 
l2002l ; lFurlanello et a/.Ll2003l) and more recently of pervasive batch 
effects jLeek et all |2010I> . We argue that it is now urgent to 
adopt a similar approach for network medicine studies. Stability 
(and thus rep roducibility) i n thi s class of studies is still an 



open problem jBaralla et al. 



5 



issue (Dc Smet and Marchal. 



2009) 



2010) 



Underdeterminacy is a major 
as the ratio between network 
dimension (number of nodes) and the number of available 
measurements to infer interactions plays a key role for the stability 
of the reconstructed structure. Furthermore, the most interesting 
applications are based on inferring networ ks topology and wiring 



from high-throughput noisy measurements jHe et ali 



2009) 



S hara n and Idekej . 



De spite its common use even in biological contexts i 
120061) , the problem of quantitatively comparing networks (e.g., 
using a metric instead of evaluating network properties) is a still 
an open issue in many scientific disciplines. The central problem 
is of course which network metrics should be used to evaluate 
stability, whether focusi ng on local change s or global structural 
changes. As discussed in ljurman et all j201 lh . the classic distances 
in the edit family focus only on the portions of the network 
interested by the differences in the presence/absence of matching 
links. Spectral distances - based on the list of eigenvalues of the 
Laplacian matrix of the underlying graph - are instead particularly 
effective f or studying global structure s. In particular, the Ipsen- 
Mikhailov jlpsen and Mikhailov , l2002h distance was found robust 
in a wide range of situations jjurman et a/.ll201ll) . However, global 
di stances can be tricke d by isomorph or close to isomorph graphs. 
In ljurman et al. l r 2012T) . both approaches are improved by proposing 
a glocal measure which combines a spectral distance with a typical 
Hamming local editing component. In this paper we use this new 
tool to quantify how stability of network reconstruction is modified 
in practice by the different inference and enrichment methods. 
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Pathway enrichment methods are widely used in bioinformatics 
analysis, for example to assess the relevance of biomarker lists or as 
a first step in network analysis. The enrichment step is performed 
using the functional information stored in the Kyoto Encyclopedia 



of Genes and Genomes (KEGG) database (Kanehisa and Goto, 



2000) and in the Gene Ontology (GO) database l lAshburner et al. 



20001) . The reconstruction of molecular pathways from high- 



throu ghput dat a is th en based on the th e ory of complex network s 
(e.g. Istrogatzl fcOOll); iNewmanl I200I); iBoccaletti e/oZl d2006h : 
iNewmanl d201C ): iBuchanan et al\ l l2010l) ) and, in particular, in 
the reconstructio n algorithms for inferring networks topology and 
wiring from data dHe et a/.Ll2009h . 

The stability analysis is applied to networks identifying common 
and specific network substructures that could be basically associated 
to disease. The overall goal of the pipeline is to identify and rank 
the pathways reflecting major changes between two conditions, 
or during a disease evolution. We start from a profiling phase 
based on classifiers and feature selection modules organized 
in terms of experimental proc edure by Data Analysis Protocol 
jThe MAOC Consortium!. |2010|) . obtaining a ranked list of genes 
with the highest discriminative power. Classification tasks as well 
as quantitative phenotype targets can be considered by using 
different machine learning methods in this first phase. Alternative 
methods are made available as components of one Open Source 
pipeline system. The problem of underdeterminacy in the inference 
procedure is avoided by focusing only on subnetworks, and the 
relevance of the studied pathways for the disease is judged in 
terms of discriminative relevance for the underlying classification 
problem. 

As a testbed for the glocal stability analysis, we compare network 
structures on a collection of microarray data for patients affected 
by Parkinson ' s disea se (PD), together with a cohort of controls 
fehang et all l2005bl) . PD is a neurodegenerative disorder that 
impairs the motor skills at the onset and the cognitive and the speech 
functions successively. The goal of this task is to identify the most 
relevant dirsupted pathways and genes in late stage PD. On this 
dataset, we show that choosing different profiling approaches we 
get low overlapping in terms of common enriched pathways found. 
Despite this variability, if we consider the most disrupted pathways, 
their glocal distances (between case and control networks) share a 
common distribution assessing equal meaningfulness to pathways 
found starting from different approaches. 



2 MATERIALS AND METHODS 

The machine l earning pipe l ine ad opted in this paper has been originally 
introduced in jBarla fl/.L l201lh . As shown in Figure [T] it handles 
case/control transcription data through four main steps, from a profiling task 
to the identification of discriminant pathways. The pipeline is independent 
from the algorithms used: here we describe each step and the implementation 
adopted for the foilowing experiments evaluating the impact of different 
enrichment methods in pathway profiling. 

Formally, we are given a collection of n subjects, each described by a 
p-dimensional vector x of measurements. Each sample is also associated 
with a phenotypical label, e.g. y = {1, —1}, assigning it to a class (in a 
classification task). The dataset is therefore represented by a nxp expression 
data matrix X, where p>n, and a corresponding labels vector Y. 
Feature Selection Step 

The matrix X is used to feed the profiling part of the pipeline within a 



proper Data Analysis Protocol, which w ill ensure accurate and reproducible 
results iThe MAOC Consortiuni |2010|) . The prediction model Ai is built 
by using two different algorithms for classification and feature ranking. 
The more recent one is the li£2 regularization with double optimization, 
capable of selecting subsets of discriminative genes. The algorithm can be 
tuned to give a minimal set of discriminative genes or larger sets including 
co rrelated genes and it is based on the optimization principle presented 
in IZou and Ffa stie 12005). The implementation used consists of two stages 
I De Mol et al. , 2009) and it is cast in nested loops of 10-fold cross-validation 
IBarla et all 120081) . The first stage identifies the minimal set of relevant 
variables (in terms of prediction error), while, starting from the minimal 
list, the second one selects the family of (almost completely) nested lists of 
relevant variables for increasing values of linear correlation. As alternative 
choice we consider Liblinear, a lineal' Support Vector Machine (SVM) 
classifier specifically desig ned for large datasets (millions of instances and 
features) jFan et all |2008|) . In particular, the classical dual optimization 
problem with L2-SVM loss function is solved with a coordinate descent 
method. For our experiment we adopt the ^-regularized penalty term and 
the module of the weights for ranking purposes within a 100 X 3-fold 
cross validation schema. We build a model for increasing feature sublists 
where the feature ranking is defined according to the importance for the 
classifier. We choose the model, and thus the top ranked features, providing 
a balance between the accuracy of the classifier and the stability of the 
signature iThe MAOC Co nsortium. 2010). Thus, the output of this first step 
is a gene signature g\ , (one for each model M ) containing the k most 
discriminant features, ranked according their frequency score. 
Pathway Enrichment 

The successive enrichment phase derives a list of relevant pathways from 
the discriminant features, moving the focus of the analysis from single 
genes to fu n ctiona lly related pathways. As outlined in the review by 
iHuang et al\ 120091) . in the last 10 years the gene-annotation enrichment 
analysis field has been growing ra pidly and se v eral b ioinformatics tools 
have been designed for this task. IHuang et all 120091) provide a unique 
categorization of these enrichment tools in three major categories based 
on the underlying algorithm: singular enrichment analysis (SEA), gene 
set enrichment analysis (GSEA), and modular enrichment analysis (MEA). 
We choose one representative £ for each class for our comparison 
referring as sources of information T> both to the KEGG, to explore 
known information on molecular interaction networks, and GO, to explore 
functional characterization and biological annotation. In the firs t category we 
choose WebGestalt (WG), an online gene set analysis toolkit jZhang et all 
l2005al) taking as input a list of relevant genes/probesets. The enrichment 
analysis is performed in KEGG and GO identifying the most relevant 
pathways and ontologies in the signatures. WG adopts the hypergeometric 
test to evaluate functional category enrichment an d performs a multiple test 
adjust ment (the default method is the one from iBeniamini and Hochberd 
fl995k The user may choose different significance levels and the minimum 
number of genes belong ing to the selected functional groups. GSEA 
ISubramanian et all |2005|) is our representative of the second class. It first 
performs a correlation analysis between the features and the phenotype 
obtaining a ranked list of features. Secondly it determines whether the 
members of given gene sets are randomly distributed in the ranked list of 
features obtained above, or primarily found at the top or bottom. We use 
the preranked analysis tool, feeding the ranked lists of genes produced by 
the profiling phase directly to the enrichment step of GSEA. To avoid a 
miscalculation of the enrichment score ES, we provide as input the complete 
list of variables (not just the selected ones), assigning to the not-selected 
a zero score. Note that GSEA calculates enrichment scores that reflect 
the degree to which a pathway is overrepresented at the top or the bottom 
of the ranked list. In our analysis we considered only pathways enriched 
with the top of the list. Finally, the tool in the MEA class is the Pathways 
and Literature Strainer (PaLS) dAlibes et all |2008|) . which takes a list or 
a set of lists of genes (or protein identifiers) and shows which ones share 
the same GO terms or KEGG pathways, following a criterion based on a 
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Fig. 1. General scheme of the analysis pipeline, with the indication of the algorithms and tools used in the PD dataset application (lower boxes). 



threshold t set by the user. The tool provides as output those functional 
groups that are shared at least by the t% of the selected genes. PaLS is aimed 
at easing the biological interpretation of results from studies of differential 
expression and gene selection, without assigning any statistical significance 
to the final output. Applying the above mentioned pathway enrichment 
techniques, we retrieve for each gene gi the corresponding whole pathway 
Pi = {hi, •••! ht}, where the genes hj ^ gi not necessarily belong to the 
original signature g\ , g^. Extending the analysis to all the h j genes of the 
pathway allows us to explore functional interactions that would otherwise get 
lost. 

Subnetwork Inference 

For each pathway, networks are inferred separately on data from the different 
classes. The subnetwork inference phase requires to reconstruct a network 
N PiiV on the pathway p; by using the steady state expression data of the 
samples of each class y. The network inference procedure is limited to the 
sole genes belonging to the pathway pi in order to avoid the problem of 
intrinsic underdeterminacy of the task. As an additional caution against this 
problem, in the following experiments we limit the analysis to pathways 
having more than 4 nodes and less than 1000 nodes. The pipeline allows 
to run analysis in parallel with different methods and thus to evaluate the 
variability along the whole pipeline. We adopt four different subnetwork 
reconstruction algori thms Af. the W eighted Gene Co-Expression Networks 
(WGCN) algorithm iHorvathl l201lh . the Algo rithm for the Reconstru ction 
of Accurate Cellular Networks (ARACNE) iMargolin et all |2006|) . the 
Context Likelihood of Relatedness (CLR) approach iFaith et aZtl2007l) . and 
the Reverse Engineering Gene Ne tworks using Artificial Neural Networks 
(RegnANN) iGrimaldi et alll201lh . In this work, we applied WGCNA, CLR 
and ARACNE to analyze the pathway identified in the Pathway Enrichment 
step, while RegnANN was used, as an alternative algorithm, to reconstruct 
interesting disrupted pathways and to compare its results with results from 
methods mentioned above. WGCNA is based on the idea of using (a function 
of) the absolute correlation between the expression of a couple of genes 
across the samples to define a link between them. ARACNE is a method 
for inferring networks from the transcrip t ion le vel ^Margolin et al l l200d) 
to the metabolic level iNemenman et all |2007|) . Beside it was originally 
designed for handling the complexity of regulatory networks in mammalian 
cells, it is able to address a wider range of network deconvolution problems. 
This information-theoretic algorithm removes the vast majority of indirect 
candidate interactions inferred b y co-expression methods b y using the data 
processing inequality property jCover and Thomasl Il99ll) . CLR belongs 
to the relevance networks class of algorithms and is employed for th e 
identification of transcriptional regulatory interactions jFaith et all l2007t) . 
In particular, interactions between transcription factors and gene targets are 
scored by using the mutual information between the corresponding gene 
expression levels coupled with an adaptive background correction step. 
Indeed the most probable regulator-target interactions are chosen comparing 
the mutual information score versus the "background" distribution of mutual 



information scores for all possible pairs within the corresponding network 
context (i.e. all the pairs including either the regulator or the target). 
RegnANN is a newly defined method for inferring gene regulatory networks 
based on an ensemble of feed-forward multilayer perceptrons. Correlation 
is used to define gene interactions. For each gene a one-to-many regressor 
is trained using the transcription data to learn the relationship between the 
gene and all the other genes of the network. The interaction among genes are 
estimated independently and the overall network is obtained by joining all 
the neighborhoods. Summarizing, we obtain a real-valued adjacency matrix 
as output of the subnetwork inference step for each dataset X, for each 
class y, for each model M, for each enrichment tool £, for each source 
of information T>, for each pathway pi , and for each subnetwork inference 
algorithm AT. We thus need to quantitatively evaluate network differences, 
i.e. using a metric instead of evaluating network properties. 
Subnetwork Distance and Stability 

Among the possible choices already available in literature, we focus on two 
of the most common distance families: the set of edit-like distances and the 
spectral distances. The functions in the former family quantitatively evaluate 
the differences between two networks (with the same number of nodes) 
in terms of minimum number of edit operations (with possibly different 
costs) transforming one network into the other, i.e. deletion and insertion 
of links, while spectral measures relies on functions of the eigenvalues of 
one of the connecti vity matrices of the underlying graph. As discussed in 
Ijurman et al\ feOl ll ). the drawback of many classical distances (such as those 
of the edit family) is locality, that is focusing only on the portions of the 
network interested by the differences in the presence/absence of matching 
links. Spectral distances can overcome this problem considering the global 
structure of the compared topologies. Within t hem, we consider the Ipsen- 
Mikhailov e distance: originally introduced in [jpsen and Mikhailovj 120021) 
as a tool for network reconstruction from its Laplacian spectru m, it has been 
proven to be the most robust in a wide range of situations by Ijurman et all 
<201 lh - We are also aware that spectral measures are not flawless: they 
cannot distinguish isomorphic or isospectral graphs, which can correspond 
to quite different conditions within the biologica context. We thus introduce 
the glocal distance </> as a possible solution against both issues: 4> is defined 
as the product metric of the Hamming distance H (as representative of the 
edit-familiy) and th e e distance. Full mathematical details are available in 
|jurmaneffl7U2012l) . 

Relying on the distances e and <f>, we evaluate networks corresponding to 
the same pathway for different classes, i.e. all the pairs (jV Pi +i, N Pi -i) 
and rank the pathways themselves from the most to the least changing across 
classes. 

Moreover, we attached to each network a quantitative measure of stability 
with respect to data subsampling, in order to evaluate the reliability of 
inferred topologies. In particular, for each N Pi . y , we extracted a random 
subsampling (of a fraction r of X labelled as y) on which the corresponding 
N PiyV will be reconstructed. Repeating m times the subsampling/inferring 
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Table 1. Summary of pathways retrieved in the 
pathway enrichment step. The numbers in brackets refer 
to the pathways considered for the network inference 
step. 



M 


V 


WG 


£ 

GSEA 


PaLS 


txh 


GO 


114(92) 


7(7) 


381 (331) 


KEGG 


43 (43) 


2(2) 


71 (71) 


Liblinear 


GO 
KEGG 


83 (45) 
56 (55) 


0(0) 
1(1) 


404 (356) 
77 (77) 



procedure, a set of m nets will be generated for each N PitV . Then all 
mutual (™) = m \™ — 11 distances are computed, and for each set of 
m graphs we build the corresponding distance histogram. In particular, 
for our experiments we set m = 20 and r = Mean and variance 
of the constructed histograms will quantitatively assess the stability of the 
subnetwork inferred from the whole dataset: the lower the values, the higher 
the stability in terms of robustness to data perturbation (subsampling). 
Data description and preprocessing 

The present e d appr oach is applied to PD data originally introduced in 
IZhang et al\ fe005bl) and publicly available at Gene Expression Omnibus 
(GEO), with accession number GSE20292. The biological samples consist 
of whole substantia nigra tissue in 11 PD patients and 18 healthy controls. 
Expressions were measured on Affymetrix HG-U133A platform. We 
perform the data normalization on the raw data with the rma algorithm 
of the R Bioconductor affy package with a custom CDF (downloaded 
from BrainArray: http://brainarray.mbni.med.umich.edu I adopting Entrez 
identifiers. 

Software Availability 

The Python implementation of I1I2 regularization with double optimization 
is available at http://slipguru.disi.unige.it/Software/LlL2Py Liblinear was 
originally developed by the Machine Learning Group at the National 
Taiwan University and it is now available within the Python mlpy 
library (http://mlpy.fbk.eul. We adopt the 12r_121oss_svc_dual 
solver, with C = 10e — 4 . WG is available as a web application at 
http://bioinfo.vanderbilt.edu/webgestalt GSEA is available either as a web 
application or a Java stand-alone tool at http://www.broadinstitute.org/gsea 
PaLS is available online at http://pals.bioinfo.cnio.es as a web application. 
For three of the network reconstruction algorithms, we adopted the R 
Bioconductor implementation: the WGCNA package for WGCN, and MiNET 
(Mutual Information NETworks package) for ARACNE and CLR. In 
particular, we set the WGCNA soft thresholding exponent to 5, while we 
keep the de fault value for the A RACNE data processing inequality tolerance 
parameter iMever et a/1 120081) . Moreover, the ARACNE implementation 
requires all the features to have non-zero variance on each class and for 
consistency purposes we applied this in all experiments. RegnANN is instead 
available from http://sourceforge.net/projects/regnann It is implemented in 
C and relies on GPGPU programming paradigm for improving efficiency. 
The glocal distance tf> is available upon request to the authors either as R 
script or Python script. The computation of the Ipsen-Mikhailov distance e 
is included as component of the glocal script. 

3 RESULTS AND DISCUSSION 

The feature selection results varied accordingly to the chosen 
method: £\l2 identified 458 discriminant genes associated to an 
average prediction performance of 80.8%, while with Liblinear we 
selected the top-500 genes associated to an accuracy of 80% (95% 



Table 2. Summary of common most disrupted 
pathways (<j> > 0.05). 



M 


V 


|n(Ali £ )| 


|n(£wG, PaLs)l 




GO 





17 


KEGG 


1 


22 


Liblinear 


GO 
KEGG 






5 

21 



boostrap Confidence Interval: (0.78;0.83)) coupled with a stability 
of 0.70. The lists have 119 common genes. 

The number of enriched pathways greatly varied depending on 
the selection and enrichment tools. With I1I2, we found globally for 
GO and KEGG, 157, 452 and 9 pathways as significantly enriched, 
for WG, PaLS and GSEA respectively. Similarly, for Liblinear, 
the identified pathways were: 139, 481 and 1. Table [7] reports the 
detailed results for model M, enrichment £ and database T>. 

If we consider the £\li selection method and the enrichment 
performed within the GO, we may note that no common GO terms 
were selected across enrichment methods. A significant overlap 
of results was found only between WG and PaLS, with 30 GO 
common terms. Similar considerations may be drawn with the 
results from the Liblinear feature selection method. Within the GO 
enrichment we did not identify any common GO term among the 
three enrichment tools. Considering only WG and PaLS, we were 
able to select 12 common GO terms. 

If we consider the l\l2 selection method and the enrichment 
performed within KEGG, two common pathways are identified 
across enrichment methods. A significant overlap of results was 
found between WG and PaLS, with 43 common pathways. For 
Liblinear, only one common pathway was selected among the three 
enrichment tools. A significant overlap of results was found between 
WG and PaLS, with 55 common pathways. 

Following the pipeline, we also performed a comparison of the 
three network reconstruction methods. We considered the most 
disrupted networks, keeping for the analysis those pathways that had 
a glocal distance greater or equal to the chosen threshold r = 0.05. 
The choice of such threshold was made considering the distribution 
of glocal distances <j> for the methods A4. For instance, if we 
consider the Liblinear selection method and the KEGG database, 
we have a cumulative distribution as depicted in Figure [2]a). The 
threshold r is set to 0.05 and allows retaining at least 50% of 
pathways. The plot in Figure [2jb) represents the glocal distances 
distribution for all enrichment methods £ with respect to the two 
components of the glocal distance: the Ipsen distance e and the 
Hamming distance H. The red curved line represents the threshold 
t in this space. The plot in Figure |2]c) is detailed for subnetwork 
inference method N ■ 

After retaining the most distant pathways, we performed a 
comparison of common terms for fixed selection method M and 
database T>. The results are reported in Table [2] In Tables [3] and [4] 
we report the most disrupted GO terms and KEGG pathways that 
have a glocal distance <fi greater or equal to the chosen threshold r. 

As an example of a selected pathway within KEGG, the networks 
(thresholded at edge weight 0.1 for graphic purposes) inferred 
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Table 3. Summary of most disrupted GO terms common between WG and PaLS, for different models M . Each GO term is associated to a glocal distance 
<j> > 0.05 for all subnetwork reconstruction algorithms M. GO terms are sorted according decreasing average </>. Bold fonts represent the GO terms shared by 
model M . 









Liblinear 


ID 


Tprm namp 

1 LI 111 1UU1IL 


ID 


Tpini n n m p 

1 L 1 111 1UM1IL 


GO:0005739 


Mitochondrion 


GO: 0042 127 


Regulation of cell proliferation 


GO:0031966 


Mitochondrial membrane 


GO:0005783 


Endoplasmic reticulum 


GO:0005743 


Mitochondrial inner membrane 


GO:0015629 


Actin cytoskeleton 


GO:0042802 


Identical protein binding 


GO:0006469 


Negative regulation of protein kinase activity 


GO:0007018 


Microtubule-based movement 


GO:0005747 


Mitochondrial respiratory chain complex I 


GO:0046961 


Proton-transporting ATPase activity, rotational mechanism 






GO:0005753 


Mitochondrial proton-transporting ATP synthase complex 






GO:0000502 


Proteasome complex 






GO:0015986 


ATP synthesis coupled proton transport 






GO:0045202 


Synapse 






GO:0048487 


Beta-tubulin binding 






GO:0042734 


Presynaptic membrane 






GO:0005747 


Mitochondrial respiratory chain complex I 






GO:0006120 


Mitochondrial electron transport, NADH to ubiquinone 






GO:0015078 


Hydrogen ion transmembrane transporter activity 






GO:0015992 


Proton transport 






GO:0005874 


Microtubule 
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Fig. 3. (a) Networks inferred by WGCNA algorithm for the ALS KEGG pathway for PD patients (above) and controls (below), on the same pathway for 
different inference algorithm, (b) WGCNA is the method showing the highest stability on the two classes, (c) Same pathway reconstructed with RegnANN. 



by WGCNA (together with the corresponding stability) on the 
Amyotrophic Lateral Sclerosis KEGG pathway (ALS - 05014) are 
displayed in Figure [3] We also plot the inferred network by the 
RegnANN algorithm. Similarly, in Figure [4] we plot the Pathogenic 
E. coli infection KEGG pathway, reconstructed by WGCNA, its 
stability plot, and the corresponding inferred networks by the 
RegnANN algorithm. 
Discussion 

The variability in the results, as expected, strongly depends on 
the method of choice. For feature selection, the nature of the 
method is key. In the proposed pipeline we limited the impact 



of this step by choosing two approaches within the regularization 
methods family. Both classifiers adopt a I2 -regularization penalty 
term, combined with different loss functions and, for £1^2 with 
another regularization term. We used similar but not equal model 
selection protocols. Both guarantee that the results are not affected 
by selection-bias. In this work, the main source of variability was the 
choice of the gene enrichment module. Therefore, the experimenter 
must be careful in choosing one method or another and in using 
it compliantly with the experimental design. For instance, GSEA 
was designed for estimating the significance levels by considering 
separately the positively and negatively scoring gene sets within a 
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Table 4. Summary of most disrupted KEGG pathways common between WG and PaLS, for different models M. Each pathway is associated to a glocal 
distance <f> > 0.05 for all subnetwork reconstruction algorithms A/", KEGG pathways are sorted according decreasing average <j>. Bold fonts represent the 
KEGG pathways shared by model M . 



Liblinear 



ID 


Pathway name 


ID 


Pathway name 


01100 


Metabolic pathway 


04630 


Jak-STAT signaling pathway 


05130 


Pathogenic Escherichia coli infection 


01100 


Metabolic pathway 


04910 


Insulin signaling pathway 


05130 


Pathogenic Escherichia coli infection 


00310 


Lysine degradation 


04623 


Cytosolic DNA-sensing pathway 


04140 


Regulation of autophagy 


00330 


Arginine and proline metabolism 


03050 


Proteasome 


04910 


Insulin signaling pathway 


00230 


Purine metabolism 


05212 


Pancreatic cancer 


05014 


Amyotrophic lateral sclerosis* 


03030 


DNA replication 


00980 
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*Note: This is the only selected pathway shared across all enrichment methods £. 



list of genes selected with filter methods based on classical statistical 
tests. It is worth noting that, if one uses the preranked option, 
as we did, negative regulated groups might not be significant at 
all (we indeed discarded them). WG uses the Hypergeometrical 
test to assess the functional groups but, differently from GSEA, 
does not use any significance assessment based on permutation 
of phenotype labels. PaLS is the simplest methods, being just a 
measure of occurrences of a given descriptor in the list of selected 
genes. However, enrichment methods from different categories are 
complementary and can identify different but equally meaningful 
biological aspects of the same process. Thus, the integration of 
information across different methods is the best strategy. 

Moreover, the assessment of the reconstruction distance between 
case and control version of the same pathways help in providing 
a quantitative focus on the key pathway involved in the process. 
The use of a distance mixing the effects of structural changes with 
those due to the differences in rewiring moreover warrants a more 
informative view on the difference assessment itself. The limited 
effect of different feature selection methods is confirmed by the 
plots in Figure[5] 

For the only most disrupted pathway shared by the three 
enrichment tools £ and the three reconstruction methods N is ALS. 
This pathway is relevant in this context because, like PD, ALS is 
another neurodegenerative disease therefore they share significant 



biological features in particular at the mithocondrial level. Moreover 
at the phenotypic level the skeletal muscles of the patients are 
severely affects influencing the movements. In Figure[3]it is evident 
that a high number of interactions are established among the genes 
going from the control (below) to the affected (above) pathways. 
It is also interesting to underline that CYCS (Entrez ID: 54205) 
one of the hub genes (represented by a red dot in the graph) 
within the pathway was identified by i\t% as discriminant. This 
gene is highly involved in several neurodegenerative diseases (e.g., 
PD, Alzheimer's, Huntington's) and in pathways related to cancer. 
Furthermore its protein is known to functions as a central component 
of the electron transport chain in mitochondria and to be involved 
in initiation of apoptosis, known cause of the neurons loss in 
PD. Across variable selection algorithms AL five highly disrupted 
pathways were found as shared between two of the three enrichment 
methods (see Table [4] bold items). In particular, we represented in 
Figure [4] the corresponding inferred networks. To further highlight 
the different outcomes occurring from the same dataset when 
diverse inference methods are employed, we reconstructed the ALS 
and Pathogenic E. coli infection by the RegnANN algorithm, which 
tends to spot also second order correlation among the network 
nodes, see Figures[3]and|4] 

Two genes in the E. coli infection pathway were selected both 
by txti and Liblinear, namely ABL1 (Entrez ID:71) and TUBB6 
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Fig. 4. (a) Networks inferred by WGCNA algorithm for the Pathogenic E. coli infection KEGG pathway for PD patients (above) and controls (below), on the 
same pathway for different inference algorithm, (b) WGCNA is the method showing the highest stability on the two classes, (c) Same pathway reconstructed 
with RegnANN. 



(Entrez ID: 84617). ABL1 seems to play a relevant role as hub 
both in the WGCNA and in the RegnANN networks. ABL1 is a 
protooncogene that encodes protein tyrosine kinase that has been 
implicated in processes of cell differentiation, cell division, cell 
adhesion, and stress response. It was also found to be responsible 
of apoposis in human brain microvascular endothelial cells. In 
Figure [6] we note that pathways with high number of genes are 
similar in term of local distance, instead a wider range of variability 
is found looking at the spectral distance. The red line in[6tb) divides 
the 2 cluster. Pathway targets beyond and within the red line are 
represented in the cumulative histogram inUJa). Pathways beyond 
the threshold are equally distributed and they represent a wider 
range of targets, instead pathways within the threshold show a 
smaller number of targets[6]a) on the right. 



4 CONCLUSION 

Moving from gene profiling towards pathway profiling can be an 
effective solution to overcome the problem of the poor overlapping 
in -omics signatures. Nonetheless, the path from translating a 
discriminant gene panel into a coherent set of functionally related 
gene sets includes a number of steps each contributing in injecting 
variability in the process. To reduce the overall impact of such 
variability, it is thus critical that, whenever possible, the correct 
tool for each single step is adopted, accurately focussing on the 
desired target to be investigated. This mainly holds for the choice 
of the most suitable enrichment tool and biological knowledge 
database, and, to a lower extent, to the inference method for 
the newtork reconstruction: all these ingredients are planned 
for different objectives, and their use on other situations may 
result misleading. As a final observation and a possible future 
development to explore, the emerging instability can be tackled by 
obtaining the functional groups identification as the result of a prior 
knowledg e injection in the learning ph ase, rather than a procedure a 
posteriori fevcinski et ^Zll201 lll2012h . 
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Fig. 2. Detailed distance plot for Liblinear and KEGG (see Figure[3p). The 
histogram plot in (a) represents the cumulative histogram for all distances 
across enrichment methods £ and subnetwork inference algorithms A/*. The 
threshold t is set to retain at least 50% of pathways Tn (h) a plof of 
Hamming vs. Ipsen distances (H vs. e) for all possible combinations of Q 
and AT, which is detailed in (c). 
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Fig. 6. (a) Pathway target cumulative histogram, (b) Hamming versus Ipsen 
(H vs. e) distance, and thresholding of high populated pathways. 
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